Training (add tensorboard debug, and mAP Calculation) #206

KUASWoodyLIN · 2018-08-06T14:00:37Z

Provide useful debug information on tensorboard

mAP scalars

Images

Distributions

Histograms

chenyuqing

/home/tim/anaconda3/bin/python /home/tim/workspaces_wx/keras-yolo3/voc_train_eval.py
/home/tim/anaconda3/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
Using TensorFlow backend.
/home/tim/anaconda3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
return f(*args, **kwds)
2018-09-22 14:30:49.472148: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-09-22 14:30:49.562588: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-09-22 14:30:49.562875: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce MX150 major: 6 minor: 1 memoryClockRate(GHz): 1.5315
pciBusID: 0000:01:00.0
totalMemory: 1.95GiB freeMemory: 1.36GiB
2018-09-22 14:30:49.562886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce MX150, pci bus id: 0000:01:00.0, compute capability: 6.1)
Create YOLOv3 model with 9 anchors and 2 classes.
Traceback (most recent call last):
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/common_shapes.py", line 686, in _call_cpp_shape_fn_impl
input_tensors_as_shapes, status)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimension 0 in both shapes must be equal, but are 1 and 255 for 'Assign_360' (op: 'Assign') with input shapes: [1,1,1024,21], [255,1024,1,1].

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/tim/workspaces_wx/keras-yolo3/voc_train_eval.py", line 529, in
yolo = Yolo()
File "/home/tim/workspaces_wx/keras-yolo3/voc_train_eval.py", line 73, in init
self.yolo_model = self.create_model(yolo_weights_path='model_data/yolo_weights.h5')
File "/home/tim/workspaces_wx/keras-yolo3/voc_train_eval.py", line 117, in create_model
model_body.load_weights(yolo_weights_path, skip_mismatch=True)
File "/home/tim/anaconda3/lib/python3.6/site-packages/keras/engine/network.py", line 1161, in load_weights
f, self.layers, reshape=reshape)
File "/home/tim/anaconda3/lib/python3.6/site-packages/keras/engine/saving.py", line 928, in load_weights_from_hdf5_group
K.batch_set_value(weight_value_tuples)
File "/home/tim/anaconda3/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2435, in batch_set_value
assign_op = x.assign(assign_placeholder)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/variables.py", line 573, in assign
return state_ops.assign(self._variable, value, use_locking=use_locking)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/state_ops.py", line 276, in assign
validate_shape=validate_shape)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_state_ops.py", line 57, in assign
use_locking=use_locking, name=name)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2958, in create_op
set_shapes_for_outputs(ret)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2209, in set_shapes_for_outputs
shapes = shape_func(op)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2159, in call_with_requiring
return call_cpp_shape_fn(op, require_shape_fn=True)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/common_shapes.py", line 627, in call_cpp_shape_fn
require_shape_fn)
File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/common_shapes.py", line 691, in _call_cpp_shape_fn_impl
raise ValueError(err.message)
ValueError: Dimension 0 in both shapes must be equal, but are 1 and 255 for 'Assign_360' (op: 'Assign') with input shapes: [1,1,1024,21], [255,1024,1,1].

Process finished with exit code 1

chenyuqing · 2018-09-22T06:43:36Z

Can't you help me to see what is wrong with my code ? THanks!

KUASWoodyLIN · 2018-09-25T04:37:43Z

Hi @chenyuqing

I have try train_v2.py, but it look fine.
maybe you should check keras backend configuration file link,
I thank your "image_data_format" is not correct.
and my setting is

{
    "epsilon": 1e-07,
    "image_data_format": "channels_last",
    "floatx": "float32",
    "backend": "tensorflow"
}

make sure your settings are the same as mine.

Borda · 2019-08-15T09:31:50Z

Hi, it seems that this repo is inactive for a while... (more than a year 😟)
Would you consider to pass your changes to this fork https://github.com/Borda/keras-yolo3 ?

shocora · 2020-10-28T02:21:43Z

Hi! I'm in trouble because I can't learn. Which part of train_v2.py can I change to run it?

tfukumori · 2020-10-28T03:24:20Z

Hi! I'm in trouble because I can't learn. Which part of train_v2.py can I change to run it?

It seems that the versions of python, tensorflow and Keras are important.

You can find the following description in the repository
https://github.com/qqwweee/keras-yolo3

Python 3.5.2
Keras 2.1.5
tensorflow 1.6.0

I have also verified that it works with the following environments

Python 3.6
Keras 2.2.4
tensorflow 1.14.0

shocora · 2020-10-29T14:06:29Z

thank you for reply. I matched the version but it doesn't work.
Is there any place to change the PATH other than lines 34,35 and 41,42 of train_v2.py?Also, is it okay if LOGS_PATH is empty at the time of the first learning?

tfukumori · 2020-10-29T23:31:07Z

When I first training it, it didn't matter if the LOGS_PATH folder (the default is yolo_logs) was empty.

I run the following command

python train_v2.py --yolo_train_file 2007_train.txt --yolo_val_file nano

2007_train.txt was created using voc_annotation.py, as described at https://github.com/qqwweee/keras-yolo3

To check the training results.

I run the following command

conda install tensorboard -y
tensorboard --logdir=<yolo_logs' full path> --host 0.0.0.0

In web browser, go to http://localhost:6006/

shocora · 2020-11-10T12:45:15Z

I'm sorry to reply late.When I wrote the above command as it is, I got the following error.
Also,What does nano specify?
"File "train_v2.py", line 89, in init
images_choose = [self.val_images[i] for i in np.random.randint(0, len(self.val_images), 50)]
AttributeError: 'Yolo' object has no attribute 'val_images'"

tfukumori · 2020-11-10T22:55:00Z

@shocora

"nano" specify

Specify "nano" if you do not specify a "verification" file or if it does not exist.

As you can see around the following lines of code, "train_v2.py" trains "training" and "validation" in a 9:1 ratio, regardless of the "validation" file is specified.

keras-yolo3/train_v2.py

Line 439 in f4a9c40

if not self.val_annotation_path == 'nano':

Why the error occurred

I think this is because the following lines of code were not executed because the folder for the temporary files was left undeleted in the event of an abnormal exit, for example.

keras-yolo3/train_v2.py

Line 82 in f4a9c40

self.train_data, self.val_data, self.val_images = self.read_txt_file()

Procedure before executing the command

Before executing the command, you must delete the temporary folder and move the resulting folder.

If a working folder (tmp_*) remains due to interruption, delete it.
If the results folder (yolo_logs) remains, delete, rename or move it.

shocora · 2020-11-11T03:15:56Z

@tfukumori
Thank you
I am able to start learning.
However, it stopped with the following error.

"""
549 [Yolo loss: 36.249851]
Testing ...
[Yolo testing loss: 38.217436981201175]
Evaluate mAP
2020-11-11 11:59:25.702287: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-11-11 11:59:25.702377: E tensorflow/stream_executor/cuda/cuda_driver.cc:1032] could not synchronize on CUDA context: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure ::
2020-11-11 11:59:25.702561: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status:
"""

Also,How can I adjust the value on the horizontal axis of training loss?

tfukumori · 2020-11-11T03:53:22Z

"CUDA" error.

I'm not sure about the "CUDA" error.

From the error message, it is possible that the GPU is not powerful enough, but I'm not sure.

If it's due to a lack of GPU performance, then running it on the CPU or reducing the number of batches might solve the problem. (It's a trade-off for performance.)

https://jp.mathworks.com/matlabcentral/answers/427234-what-is-the-cause-of-cuda_error_launch_failed

Adjust the value on the horizontal axis of training loss

If you mean to change the settings of the graph, I don't know.

If you mean the number of epochs, then it seems to vary with the number of images and batches.

keras-yolo3/train_v2.py

Line 149 in f4a9c40

epoch = len(self.train_data) // self.step1_batch_size

keras-yolo3/train_v2.py

Line 199 in f4a9c40

epoch = len(self.train_data) // self.step2_batch_size

shocora · 2020-11-16T15:04:29Z

@tfukumori
I was able to finish learning in 3 days. Thanks.

Why is "tmp_pred_files" empty before and after learning?

Also, When running yolo.py in full HD, is it better to change the following numbers?

keras-yolo3/yolo.py

Line 28 in e6598d1

"model_image_size" : (416, 416),

shocora · 2020-11-30T11:21:14Z

I think mAP is usually between 0 and 1, but I get a value greater than or equal to 1.
I would appreciate it if you could tell me the cause.

tfukumori · 2020-12-08T08:27:52Z

I think mAP is usually between 0 and 1, but I get a value greater than or equal to 1.
I would appreciate it if you could tell me the cause.

Maybe that's because of the 100-fold, as you can see below.

``I don't know.
mAP * 100


https://github.com/qqwweee/keras-yolo3/blob/f4a9c40f4615cdbb774942507ecad3af5f05c990/train_v2.py#L419

shocora · 2020-12-08T14:02:34Z

Is it this number as a result of multiplying by 100?
Also, What is the standard for the mAP calculation method used here?

tfukumori · 2020-12-08T14:52:55Z

Is it this number as a result of multiplying by 100?
Also, What is the standard for the mAP calculation method used here?

I think this will be helpful.

You can find it here: https://qiita.com/mdo4nt6n/items/08e11426e2fac8433fed

KUASWoodyLIN added 5 commits August 6, 2018 21:45

Add train_v2.py(tensorboard debug tool and mAP calculation)

58728f1

fix mAP function

eecb1bf

Delete workspace.xml

1ebbb0d

add weights saveing rule, fix mAP bug

8b99e56

fix merge

98dd2d8

chenyuqing reviewed Sep 22, 2018

View reviewed changes

fix mAP functions

f4a9c40

jiayunhan approved these changes Oct 5, 2018

View reviewed changes

zuoxiang95 approved these changes Oct 18, 2018

View reviewed changes

KUASWoodyLIN mentioned this pull request Oct 18, 2018

yolo v3中文交流 #254

Open

tfukumori mentioned this pull request Aug 20, 2020

mAP Tensorboard metric #492

Open

tfukumori added a commit to tfukumori/keras-yolo3 that referenced this pull request Sep 22, 2020

Apply pull request. qqwweee#206

bfe5d95

tfukumori added a commit to tfukumori/keras-yolo3 that referenced this pull request Sep 22, 2020

Change the next pull request to yolov3-tiny support. qqwweee#206

564cce2

tfukumori mentioned this pull request Oct 8, 2020

no code to get maP for results #731

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training (add tensorboard debug, and mAP Calculation) #206

Training (add tensorboard debug, and mAP Calculation) #206

KUASWoodyLIN commented Aug 6, 2018 •

edited

chenyuqing left a comment

chenyuqing commented Sep 22, 2018

KUASWoodyLIN commented Sep 25, 2018

Borda commented Aug 15, 2019

shocora commented Oct 28, 2020

tfukumori commented Oct 28, 2020

shocora commented Oct 29, 2020

tfukumori commented Oct 29, 2020

shocora commented Nov 10, 2020

tfukumori commented Nov 10, 2020 •

edited

shocora commented Nov 11, 2020

tfukumori commented Nov 11, 2020 •

edited

shocora commented Nov 16, 2020

shocora commented Nov 30, 2020

tfukumori commented Dec 8, 2020

shocora commented Dec 8, 2020

tfukumori commented Dec 8, 2020

Training (add tensorboard debug, and mAP Calculation) #206

Are you sure you want to change the base?

Training (add tensorboard debug, and mAP Calculation) #206

Conversation

KUASWoodyLIN commented Aug 6, 2018 • edited

chenyuqing left a comment

Choose a reason for hiding this comment

chenyuqing commented Sep 22, 2018

KUASWoodyLIN commented Sep 25, 2018

Borda commented Aug 15, 2019

shocora commented Oct 28, 2020

tfukumori commented Oct 28, 2020

shocora commented Oct 29, 2020

tfukumori commented Oct 29, 2020

shocora commented Nov 10, 2020

tfukumori commented Nov 10, 2020 • edited

"nano" specify

Why the error occurred

Procedure before executing the command

shocora commented Nov 11, 2020

tfukumori commented Nov 11, 2020 • edited

"CUDA" error.

Adjust the value on the horizontal axis of training loss

shocora commented Nov 16, 2020

shocora commented Nov 30, 2020

tfukumori commented Dec 8, 2020

shocora commented Dec 8, 2020

tfukumori commented Dec 8, 2020

KUASWoodyLIN commented Aug 6, 2018 •

edited

tfukumori commented Nov 10, 2020 •

edited

tfukumori commented Nov 11, 2020 •

edited